[RLlib] Add gradient checks to avoid `nan` gradients in `TorchLearner`. #47452

simonsays1980 · 2024-09-02T13:16:44Z

Why are these changes needed?

If any gradients turn nan in TorchLearner these gradients get added to the network's weights and in turn weights become nan and all network outputs as well. As a result the training errors out and stops. This PR proposes a gradient check to only add gradients if they are sane. It switches nan values to zeros in the gradients or skips en update entirely.
The latter can be of advantage, if training phase ecnounters highly unstable policy updates (e.g. with highly explorative policies or during early stages of training). In such phases many gradients could turn nan and this may lead to corrupted internal optimizer states (e.g. Adam).

Related issue number

#47451

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: simonsays1980 <[email protected]>

…in highly unstable training phases. This helps to keep the optimizer's internal state intact whoch could get corrupted with many zero gradients. Furthermore, added better logging messages. Signed-off-by: simonsays1980 <[email protected]>

Signed-off-by: simonsays1980 <[email protected]>

sven1977 · 2024-09-02T16:28:35Z

rllib/core/learner/torch/torch_learner.py

@@ -176,7 +176,27 @@ def compute_gradients(
    def apply_gradients(self, gradients_dict: ParamDict) -> None:
        # Set the gradient of the parameters.
        for pid, grad in gradients_dict.items():
-            self._params[pid].grad = grad
+            # If updates should not be skipped turn `nan` gradients to zero.


Wait, I'm confused. We have this block here further below, which I think does the exact same thing: skips the entire optim.step in case any gradient is non-finite (inf or nan).

# `step` the optimizer (default), but only if all gradients are finite. elif all( param.grad is None or torch.isfinite(param.grad).all() for group in optim.param_groups for param in group["params"] ):

Can you check and see whether these two logics can be consolidated?

Kind of like this:

If user sets this flag (default=False), the optimizer will skip the update step entirely (+ warning raised by RLlib).

If user does NOT set this flag (default behavior), grads that are non-finite will be set to 0.0 (+ warning raised by RLlib).

…lution considers non-finite gradients and gives the user still the option to set such gradients to zero to keep the optimizer's internal state intact. Signed-off-by: simonsays1980 <[email protected]>

sven1977

LGTM! Thanks @simonsays1980

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

simonsays1980 added 2 commits September 2, 2024 14:25

Added gradient checks to avoid 'nan' gradients in 'TorchLearner'.

f4710b5

Signed-off-by: simonsays1980 <[email protected]>

simonsays1980 marked this pull request as ready for review September 2, 2024 13:16

simonsays1980 requested review from sven1977 and ArturNiederfahrenhorst as code owners September 2, 2024 13:16

simonsays1980 assigned sven1977 Sep 2, 2024

simonsays1980 added rllib RLlib related issues rllib-torch labels Sep 2, 2024

LINTER.

1ac3935

Signed-off-by: simonsays1980 <[email protected]>

simonsays1980 changed the title ~~[RLlib] - Add gradient checks to avoid 'nan' gradients in 'TorchLearner'.~~ [RLlib] - Add gradient checks to avoid nan gradients in TorchLearner. Sep 2, 2024

sven1977 reviewed Sep 2, 2024

View reviewed changes

sven1977 changed the title ~~[RLlib] - Add gradient checks to avoid nan gradients in TorchLearner.~~ [RLlib] Add gradient checks to avoid nan gradients in TorchLearner. Sep 2, 2024

simonsays1980 added 2 commits September 3, 2024 11:12

Merge branch 'master' into add-nan-checks-to-torch-learner

7af24af

Merged two logics that care for finite gradients together. The new so…

dd8148b

…lution considers non-finite gradients and gives the user still the option to set such gradients to zero to keep the optimizer's internal state intact. Signed-off-by: simonsays1980 <[email protected]>

sven1977 approved these changes Sep 3, 2024

View reviewed changes

sven1977 enabled auto-merge (squash) September 3, 2024 10:50

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 3, 2024

sven1977 merged commit e1ed103 into ray-project:master Sep 3, 2024
7 checks passed

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024

[RLlib] Add gradient checks to avoid nan gradients in `TorchLearner…

cf6bc54

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024

[RLlib] Add gradient checks to avoid nan gradients in `TorchLearner…

2e17543

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024

[RLlib] Add gradient checks to avoid nan gradients in `TorchLearner…

2915c45

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024

[RLlib] Add gradient checks to avoid nan gradients in `TorchLearner…

3fe95f8

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024

[RLlib] Add gradient checks to avoid nan gradients in `TorchLearner…

c98d8cc

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024

[RLlib] Add gradient checks to avoid nan gradients in `TorchLearner…

aa1663e

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024

[RLlib] Add gradient checks to avoid nan gradients in `TorchLearner…

8c990d8

…`. (ray-project#47452) Signed-off-by: ujjawal-khare <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Add gradient checks to avoid `nan` gradients in `TorchLearner`. #47452

[RLlib] Add gradient checks to avoid `nan` gradients in `TorchLearner`. #47452

simonsays1980 commented Sep 2, 2024 •

edited

Loading

sven1977 Sep 2, 2024

sven1977 left a comment

[RLlib] Add gradient checks to avoid nan gradients in TorchLearner. #47452

[RLlib] Add gradient checks to avoid nan gradients in TorchLearner. #47452

Conversation

simonsays1980 commented Sep 2, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

sven1977 Sep 2, 2024

Choose a reason for hiding this comment

sven1977 left a comment

Choose a reason for hiding this comment

[RLlib] Add gradient checks to avoid `nan` gradients in `TorchLearner`. #47452

[RLlib] Add gradient checks to avoid `nan` gradients in `TorchLearner`. #47452

simonsays1980 commented Sep 2, 2024 •

edited

Loading